Add respectNodePodLimits scheduler flag to enforce per-node pod capacity#4841
Conversation
Greptile SummaryThis PR adds a The implementation is well-structured with good test coverage across configuration, jobdb, nodedb matching, and an end-to-end preempting-scheduler integration test. Confidence Score: 5/5This PR is safe to merge; no P0 or P1 issues found. All changed code paths are correct: safeGetRequirements returns a fresh map so mutation is safe, ApplyRespectNodePodLimits is idempotent and called before ResourceListFactory construction, Clone correctly propagates the flag, and the executor's unconditional pods injection is silently dropped by the scheduler when the feature is off. Comprehensive test coverage spans unit, lifecycle, matching, and integration levels. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Scheduler startup] --> B{RespectNodePodLimits?}
B -- false --> C[Normal scheduling\nno pod-slot tracking]
B -- true --> D[ApplyRespectNodePodLimits\nregisters pods in\nSupportedResourceTypes\n+ IndexedResources]
D --> E[ResourceListFactory\ntracks pods dimension]
E --> F[JobDb.SetRespectNodePodLimits\ntrue]
subgraph Executor
G[Non-Armada pod scan] --> H[allocatedByPriorityAndResourceTypeFromPods\ninjects pods:1 per pod]
H --> I[NonArmadaAllocatedResources\nreported to scheduler]
end
subgraph Scheduler per-job
F --> J[JobDb.getResourceRequirements\ninjects pods:1 into job requirements]
J --> K[NodeDb capacity check\npods consumed on bind\nfreed on evict+unbind]
end
I --> L{Scheduler has pods\nin SupportedResourceTypes?}
L -- yes --> K
L -- no --> M[pods field silently dropped\nby ResourceListFactory]
Reviews (18): Last reviewed commit: "Merge branch 'master' into respect-node-..." | Re-trigger Greptile |
4faa4a9 to
57a9176
Compare
2008bd1 to
ef28b83
Compare
ef28b83 to
257b292
Compare
58fb73a to
4110640
Compare
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
4110640 to
70b5707
Compare
|
Thanks for landing this. We'll try it out at some point |
…n-preemptible (#4879) #### What type of PR is this? /kind bug /kind cleanup #### What this PR does / why we need it The scheduler can over-pack a node when non-preemptible jobs at a lower priority hold all of its resources and a higher-priority job shows up. The higher-priority job lands on the node anyway, putting it over its declared capacity. The same gap is there for cpu, memory, and pods. Three pieces are involved, each one defensible on its own: 1. `MarkAllocated(p, rs)` in `internaltypes/resource_list_map_util.go:67` only deducts allocatable from priorities `<= p`. From a higher-priority view, lower-priority resources look free, because the assumption is "I could just preempt them if I needed to." 2. The rebalance eviction phase (`preempting_queue_scheduler.go:118`) skips non-preemptible jobs. 3. The OversubscribedEvictor (`eviction.go:164`) is the safety net for exactly this situation. It does detect the negative allocatable, but it also refuses to evict non-preemptible jobs, so it ends up with nothing to do. For preemptible incumbents the assumption holds and eviction does its job. For non-preemptible ones the assumption is wrong and the over-pack stays. The fix lives in `nodedb/nodedb.go`. The bind/unbind/evict paths now compute a `priorityCutoffFor(job, scheduledPriority)`: preemptible jobs use their scheduled priority as the cutoff (existing behavior), non-preemptible jobs use a sentinel `nonPreemptibleCutoff = math.MaxInt32` so the existing `markAllocated`/`markAllocatable` helpers deduct (or release) at every real priority. Once `AllocatableByPriority` reflects what the node really has free, the matcher and the OversubscribedEvictor do the right thing without any further changes. The PR has two commits so the bug and the fix are easy to see separately: - `Reproducer:` adds `TestPreemptingQueueScheduler_NonPreemptibleOverPack`. It uses cpu rather than pods so the assertion is on the priority model itself. Run against this commit alone, the test fails. - `Deduct non-preemptible...` is the fix. The reproducer now passes and nothing else in the suite needed touching, except `TestEviction`, which had hardcoded expected values reflecting the pre-fix accounting at high priorities. Updated. #### Which issue(s) this PR fixes Fixes # #### Special notes for your reviewer A few things to flag: I found this while validating #4841 (the `respectNodePodLimits` flag). That PR works fine in the common cases (preemptible incumbents, free slots, gangs); it just surfaces this older scheduler-wide issue, which is what's being addressed here at its real root. Performance: each bind/unbind/evict for a non-preemptible job iterates all priority levels (about 7 in practice) instead of `<= p`. Invisible at scale. Behavior change for fair share: non-preemptible jobs now consume from higher-priority queues' "available" budget. If any workload was implicitly counting on the over-allocation, it would show up as different scheduling decisions. Happy to flag-gate the rollout if that's a concern. I ran the full test suite locally across `internal/scheduler/...`, `internal/executor/...`, `internal/server/...`, and `internal/scheduleringester`. Everything passes. --------- Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
Summary
scheduling.respectNodePodLimitsfeature flag (defaultfalse) that enables the scheduler to trackpodsas a resource and reject scheduling to nodes that have exhausted their pod limit (node.Status.Allocatable["pods"])podsinsupportedResourceTypesandindexedResourcesat startup, and injectspods: 1into every job's internal resource requirementsNonArmadaAllocatedResourcesso the scheduler can subtract system/DaemonSet pods from available capacityFixes #4515
This PR builds on top of #4517 and big thanks for the initial work to @Sovietaced
Operator upgrade notes
NonArmadaAllocatedResourcesgains apodskey in every report regardless of whether any scheduler has the flag enabled. Dashboards, metrics, or custom consumers that iterate this map generically (e.g. sum over all keys) will start including pod counts. Audit prometheus / Grafana panels before rollout.falsestops the scheduler from tracking pods; reverting the executor binary removes thepodskey from its reports. Neither requires data migration.FromNodeProtosilently drops unknown resources). New scheduler + old executor briefly overestimates free pod capacity by the count of non-Armada pods per node (~10-30 typically, DaemonSets + system pods), since the old executor does not report them. The overestimate resolves as executors are upgraded.Known limitations
podsis not added todominantResourceFairnessResourcesToConsider. On dense-pod nodes (e.g. GKE's 110-pod limit) a queue running many small pods can monopolize pod slots without a fair-share penalty. Deferred per reviewer request; follow-up if this becomes a problem in practice.